Guidelines and Framework for a Large Scale Arabic Diacritized Corpus

نویسندگان

  • Wajdi Zaghouani
  • Houda Bouamor
  • Abdelati Hawwari
  • Mona T. Diab
  • Ossama Obeid
  • Mahmoud Ghoneim
  • Sawsan Alqahtani
  • Kemal Oflazer
چکیده

This paper presents the annotation guidelines developed as part of an effort to create a large scale manually diacritized corpus for various Arabic text genres. The target size of the annotated corpus is 2 million words. We summarize the guidelines and describe issues encountered during the training of the annotators. We also discuss the challenges posed by the complexity of the Arabic language and how they are addressed. Finally, we present the diacritization annotation procedure and detail the quality of the resulting annotations.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Statistical Methods for Automatic diacritization of Arabic text

In this paper, the issue of adding diacritics Tashkeel to undiacritized Arabic text using statistical methods for language modeling is addressed. The approach requires a large corpus of fully diacritized text for extracting the language monograms, bigrams, and trigrams for words and letters. Search algorithms are then used o find the best probable sequence of diacritized words of a given undiac...

متن کامل

Large Scale Arabic Error Annotation: Guidelines and Framework

We present annotation guidelines and a web-based annotation framework developed as part of an effort to create a manually annotated Arabic corpus of errors and corrections for various text types. Such a corpus will be invaluable for developing Arabic error correction tools, both for training models and as a gold standard for evaluating error correction algorithms. We summarize the guidelines we...

متن کامل

A Pilot Study on Arabic Multi-Genre Corpus Diacritization

Arabic script writing is typically underspecified for short vowels and other mark up, referred to as diacritics. Apart from the lexical ambiguity found in words, similar to that exhibited in other languages, the lack of diacritics in written Arabic script adds another layer of ambiguity which is an artifact of the orthography. Diacritization of written text has a significant impact on Arabic NL...

متن کامل

Correction Annotation for Non-Native Arabic Texts: Guidelines and Corpus

We present our correction annotation guidelines to create a manually corrected nonnative (L2) Arabic corpus. We develop our approach by extending an L1 large-scale Arabic corpus and its manual corrections, to include manually corrected non-native Arabic learner essays. Our overarching goal is to use the annotated corpus to develop components for automatic detection and correction of language er...

متن کامل

Exploiting Arabic Diacritization for High Quality Automatic Annotation

We present a novel technique for Arabic morphological annotation. The technique utilizes diacritization to produce morphological annotations of quality comparable to human annotators. Although Arabic text is generally written without diacritics, diacritization is already available for large corpora of Arabic text in several genres. Furthermore, diacritization can be generated at a low cost for ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2016